Make NVTE tensor handle pool size configurable by lhb8125 · Pull Request #3090 · NVIDIA/TransformerEngine

lhb8125 · 2026-06-05T07:17:06Z

Summary

Add runtime environment variables to configure the internal NVTETensor and NVTEGroupedTensor handle pool sizes.
Keep the default pool size at 20 MiB to preserve existing behavior.
Improve exhaustion errors to mention the configured pool size and the env var to increase.

Motivation

Large model initialization paths can legitimately create more TE tensor handles than the current fixed-size pool allows, even when GPU and CPU memory are otherwise sufficient. Exposing the pool size as an environment variable avoids downstream source patches for these scale-dependent cases.

Testing

git diff --check

Signed-off-by: hongbinl <hongbinl@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-06-05T07:24:28Z

Greptile Summary

This PR exposes two new runtime environment variables (NVTE_TENSOR_HANDLE_POOL_SIZE_MB and NVTE_GROUPED_TENSOR_HANDLE_POOL_SIZE_MB) to make the internal handle pool sizes configurable, preserving the existing 20 MiB default and improving exhaustion error messages to include the configured size and the relevant env var name.

Adds GetTensorHandlePoolSizeMB and GetTensorHandlePoolCapacity helpers in an anonymous namespace that read, validate, and convert env var values to pool capacities; the computed constants replace the previously hardcoded 20 * 1024 * 1024 / sizeof(T) expressions as member initializers in both allocator singletons.
Updates docs/envvars.rst with documentation entries for the two new variables.

Confidence Score: 4/5

Safe to merge; the change is backward-compatible (default pool size unchanged) and the new code paths are straightforward.

The core logic is correct — validation guards against zero and overflow, the member-initializer ordering is right, and the singleton construction is safe. Two minor defects exist: partial string inputs like "50bad" are silently accepted as 50 due to no eof check in the shared getenv helper, and negative env var values wrap to SIZE_MAX and trigger the "too large" error rather than the intended "must be a positive integer" message. Both are input-validation edge cases with no correctness impact under documented usage.

transformer_engine/common/transformer_engine.cpp — specifically the env-var parsing and validation in GetTensorHandlePoolSizeMB

Important Files Changed

Filename	Overview
transformer_engine/common/transformer_engine.cpp	Adds anonymous-namespace helpers to read and validate pool-size env vars; replaces hardcoded constants in TensorAllocator and GroupedTensorAllocator with env-var-driven member initializers. Logic is correct but partial string parsing (no eof check) can silently accept values like "50bad" as 50, and negative inputs produce a confusing "too large" error due to size_t unsigned wrapping.
docs/envvars.rst	Adds documentation for both new env vars with correct defaults and description; consistent with the surrounding RST style.

Sequence Diagram

sequenceDiagram
    participant Env as Environment Variable
    participant Helper as GetTensorHandlePoolSizeMB
    participant Cap as GetTensorHandlePoolCapacity
    participant Alloc as TensorAllocator (singleton)
    participant App as Application Code

    App->>Alloc: first call to instance()
    activate Alloc
    Alloc->>Env: std::getenv("NVTE_TENSOR_HANDLE_POOL_SIZE_MB")
    Env-->>Helper: raw string value (or null default 20)
    Helper->>Helper: parse via getenv size_t
    Helper->>Helper: "NVTE_CHECK pool_size_mb > 0"
    Helper->>Helper: NVTE_CHECK pool_size_mb within bounds
    Helper-->>Cap: pool_size_mb
    Cap->>Cap: compute pool_size_bytes
    Cap->>Cap: "NVTE_CHECK pool_size_bytes >= sizeof(Tensor)"
    Cap-->>Alloc: MAX_TENSOR_NUM
    Alloc->>Alloc: memory.reserve(MAX_TENSOR_NUM)
    Alloc-->>App: singleton ready
    deactivate Alloc

    App->>Alloc: Allocate(mode, out, N)
    Alloc->>Alloc: "check available >= N"
    alt pool exhausted
        Alloc-->>App: NVTE_CHECK error with pool_size_mb and env var hint
    else space available
        Alloc-->>App: NVTETensor handles
    end

_{Reviews (1): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

Signed-off-by: hongbinl <hongbinl@nvidia.com>

for more information, see https://pre-commit.ci

ptrendx · 2026-06-09T15:42:16Z

I am not opposed to creating such a variable, but I would really like to see an example of such legitimate use which goes over this limit. Could you run the experiment that is failing for you with https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/common/transformer_engine.cpp#L487 set to true and send me the log of that?

Make NVTE tensor handle pool size configurable

ec48388

Signed-off-by: hongbinl <hongbinl@nvidia.com>

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Jun 5, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

c3a3285

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread transformer_engine/common/transformer_engine.cpp Outdated

Comment thread transformer_engine/common/transformer_engine.cpp

lhb8125 marked this pull request as draft June 5, 2026 08:17

lhb8125 and others added 2 commits June 8, 2026 08:22

Validate tensor handle pool env vars

5c9ee29

Signed-off-by: hongbinl <hongbinl@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

935054d

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make NVTE tensor handle pool size configurable#3090

Make NVTE tensor handle pool size configurable#3090
lhb8125 wants to merge 4 commits into
NVIDIA:mainfrom
lhb8125:codex/nvte-tensor-pool-env-var

lhb8125 commented Jun 5, 2026

Uh oh!

greptile-apps Bot commented Jun 5, 2026

Uh oh!

Uh oh!

Uh oh!

ptrendx commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lhb8125 commented Jun 5, 2026

Summary

Motivation

Testing

Uh oh!

greptile-apps Bot commented Jun 5, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

ptrendx commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants